The Annals of Applied Probability 

2007, Vol. 17, No. 4, 1222-1244 

DOI: 10.1214/105051607000000096 

© Institute of Mathematical Statistics, 2007 

WEAK CONVERGENCE OF METROPOLIS ALGORITHMS 
FOR NON-I.I.D. TARGET DISTRIBUTIONS^ 

By Mylene Bedard 

University of Warwick 

In this paper, we shall optimize the efficiency of Metropolis algo- 
rithms for multidimensional target distributions with scaling terms 
possibly depending on the dimension. We propose a method for de- 
termining the appropriate form for the scaling of the proposal distri- 
bution as a function of the dimension, which leads to the proof of an 
asymptotic diffusion theorem. We show that when there does not ex- 
ist any component with a scaling term significantly smaller than the 
others, the asymptotically optimal acceptance rate is the well-known 
0.234. 

1. Introduction. Metropolis algorithms [8, 9] provide a method for sam- 
pling from highly complex probability distributions. The ease of implemen- 
tation and wide applicability of these algorithms have given them their pop- 
ularity and they are now frequently used by practitioners at all levels in 
various fields of application. However, their convergence can sometimes be 
lengthy, which suggests the need for an optimization of their performance. 
Because the efficiency of Metropolis algorithms depends crucially on the 
scaling of the proposal density chosen for their implementation, it is funda- 
mental to judiciously choose this parameter. 

Informal guidelines for the optimal scaling problem have been proposed 
by, among others, [3] and [4], but the first theoretical results were obtained 
by [11]. In particular, the authors considered d-dimensional target distribu- 
tions with i.i.d. components and studied the asymptotic behavior (as d oo) 
of Metropolis algorithms with Gaussian proposals. It was proven that un- 
der some regularity conditions for the target distribution, the asymptotic 
acceptance rate should be tuned to be approximately 0.234 for optimal per- 
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formance of the algorithm. It was also shown that the correct proposal scal- 
ing is of the form ^ jd for some constant I as d ^ oo. The simplicity of 
the obtained asymptotically optimal acceptance rate (AOAR) makes these 
theoretical results extremely useful in practice. Optimal scaling issues have 
been explored by other authors, namely [5, 6, 10, 12, 13]. 

In this paper, we carry out a similar study for d-dimensional target dis- 
tributions with independent components. The particularity of our model is 
that the scaling term of each component is allowed to depend on the dimen- 
sion of the target distribution, which constitutes a critical distinction with 
the i.i.d. case. We provide a condition under which the algorithm admits 
the same limiting diffusion process and the same AOAR as those found in 
[11]. This is achieved, in the first place, by determining the appropriate form 
for the proposal scaling as a function of d, which is now different from the 
i.i.d. case. Then, by verifying convergence of generators, we prove that 
the sequence of stochastic processes formed by, say, the 2*th component of 
each Markov chain (appropriately rescaled) converges to a Langevin diffu- 
sion process with a certain speed measure. Obtaining the AOAR is thus a 
simple matter of optimizing the speed measure of the diffusion. 

The paper is structured as follows. In Section 2, we describe the Metropo- 
lis algorithm and introduce the target distribution setting. The main results 
are presented in Section 3, along with a discussion concerning inhomoge- 
neous proposal distributions and some extensions. We prove the theorems 
in Section 4 using lemmas proved in Sections 5 and 6, finally concluding the 
paper with a discussion. 

2. Sampling from the target distribution. 

2.1. The Metropolis algorithm. The idea behind the Metropolis algo- 
rithm is to generate a Markov chain Xq, Xi, . . . having the target distribution 
as a stationary distribution. In particular, suppose that vr is a d-dimensional 
probability density of interest with respect to Lebesgue measure. Also, let 
the proposed moves be normally distributed around x, that is, A^(x, a'^Id) 
for some o"^ and where Id is the d-dimensional identity matrix. The Metropo- 
lis algorithm thus proceeds as follows. Given X^, the state of the chain at 
time t, a value Y^+i is generated from the normal density q(X.t,y) dy. The 
probability of accepting the proposed value Y^+i as the new value for the 
chain is a(Xf,Yf+i), where 



If the proposed move is accepted, the chain jumps to Xf+i = Y^+i; oth- 
erwise, it stays where it is and Xj+i = Xt. 




7r(x)g(x,y) > 0, 
7r(x)g(x,y) =0. 
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In order to have some level of optimality in the performance of the al- 
gorithm, care must be exercised when choosing o"^. If it is too small, the 
proposed jumps will be too short and in spite of a very high acceptance 
rate, simulation will move very slowly to the target distribution. At the 
opposite extreme, a large scaling value will generate jumps in low target 
density regions, resulting in the rejection of the proposed moves and a chain 
that stands still most of the time. 

Before finding an appropriate value for o"^ between these extremes, we 
first define a criterion which is closely related to the algorithm efficiency. 
The notion of vr-average acceptance rate is defined in [11] as E[l A = 
// 7r(x)a(x, y)q(x, y) dxdy for the d-dimensional Metropolis algorithm. 

2.2. The target distribution. Consider the following d-dimensional target 
density 

d 

(1) 7r(d,xW) = n^j(^)/(^j(rf)^i)- 

We impose the following regularity conditions on the density /: / is a 
positive function and (log/(x))' is Lipschitz continuous. We also suppose 
that )4] ^ j^(£M)4j(^) dx<oo and E[(^)2] < ^. 

The d target components, although independent, are, however, not identi- 
cally distributed. In particular, we consider the case where the scaling terms 
9j'^{d), j = 1, . . . ,d, take the following form: 

Kl Kn Kn-\-l Kn-\-i Kn+m Kn+r 



c{Jil,d)) c{J(mA)) 

Ultimately, we shall be interested in the limit of the target distribution as 
d— > oo. Let n < oo denote the number of components whose scaling term 
appears a finite number of times in the limit of 0~^((i). Also, let the j'th of 
these n scaling terms be Kjjd^^ , j = 1, . . . ,n, where Xj G (—00,00) and Kj is 
some positive and finite constant. Similarly, let < m < 00 denote the num- 
ber of different scaling terms appearing infinitely often in the limit. These 
m scaling terms are taken to be Kn+i/d'^^ , i = 1, . . . ,m, with 7, G (—00, 00). 
For now, we assume the constants < -fCn+i < 00 to be the same for all 
scaling terms within each of the m groups. We shall relax this assumption 
in Section 3.2. 

For i = l,...,m, define the sets J{i,d) = {j £ {1, . . . , d}; (9j'^(d) = 
The ith set thus contains positions of components with a scaling term equal 

• 777, 

to Kn+i/d"^^ . These sets are such that \J^^iJ'{i, d) = {n + I, . . . , d}. 
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Since each of the m groups of scaling terms might occupy different pro- 
portions of 0~^((i), we also define the cardinality of the sets J{i,d): 

(2) c( = #1 j G {1, . . . , d};ef{d) = i = 1, . . . , m, 

where c{J{i,d)) is assumed to be some polynomial function of the dimension 
satisfying lim^^oo c{J'{i, d)) = oo. 

It will be convenient to rearrange the terms of &~'^{d) so that all of the 
different scaling terms appear at one of the first n + m positions: 

^-2/ ,\ _ f Kn+1 Kn+m 

(3) 

Kn+1 Kn+m Kn+1 Kn+m\ 

(pi d^- (f7i (pm )■ 

This helps to identify each component being studied as d ^ oo without 
referring to a component that would otherwise be at an infinite position. 

Without loss of generality, we assume the first n and the next m scaling 
terms in (3) to be respectively arranged according to an asymptotic increas- 
ing order, in the following sense. If < means "is asymptotically smaller than 
or equal to," then we have 9^'^{d) ^ ••• ^ ^n'^id) and similarly ^ 
• • • ^ Gn+m{d), which respectively implies that — oo < A„ < A„_i < • • • < Ai < 
oo and — oo < 7^ < 7m- 1 < • • • < 71 < 00. Based on this ordering, the asymp- 
totically smallest scaling term obviously has to be either or 9~^i{d). 

Our goal is to study the limiting distribution of each component forming 
the d-dimensional Markov process. To this end, we set the scaling term of 
the target component of interest equal to 1 [9i-*{d) = 1]. This adjustment, 
necessary to obtain a nontrivial limiting process, is performed without loss 
of generality by applying a linear transformation to the target distribution. 
In particular, when the first component of the chain is studied (i* = 1), we 
set Oi'^{d) = 1 and adjust the other scaling terms accordingly. @~^{d) thus 
varies according to the component of interest i* considered. 

2.3. The proposal distribution and its scaling. A crucial step in the im- 
plementation of Metropolis algorithms is to determine the optimal form for 
the proposal scaling as a function of d. Intuitively, it makes sense that (d) 
depends on the asymptotically smallest scaling term in 0~^((i). Otherwise, 
the proposed moves might be too large for the components with smaller 
scaling terms, resulting in a high rejection rate and compromising the con- 
vergence of the algorithm. 

Moreover, as the dimension of the target increases, more individual moves 
are proposed in a single step and it is thus more likely to generate an im- 
probable move for one of the components. To rectify the situation, it is 
recommended to decrease the proposal scaling as a function of d. 
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Hence, the optimal form of the proposal scaling turns out to be cr^(d) = 
where £^ is some constant and a is the smallest number satisfying 

(4 lim — — < oo and lim ; < oo, z = 1, . . . , m. 

Therefore, at least one of these m + 1 limits converges to some positive 
constant, while the other ones converge to 0. Since the scaling term of the 
component studied is taken to be 1, the largest possible form for the proposal 
scaling is cj^ = (y'^{d) = i'^ and so it never diverges as d grows. 

By its nature, the Metropolis algorithm is a discrete-time process. Since 
space (the proposal scaling) is a function of the dimension of the target 
distribution, we also have to rescale the time between each step in order to 
obtain a nontrivial limiting process as d— > oo. 

Let Z'^"'^(t) be the time-t value of the process sped up by a factor of d"; 
in particular, Z^'^^t) = {x['^\[d'^t]), . . . , X^'^\[d'^t])), where [•] is the "integer 
part" function. Instead of proposing only one move, the sped-up process has 
the possibility of moving on average d" times during each unit time interval. 
We are now ready to study the limiting behavior of every component of the 
sequence of processes {Z^'^\t),t > 0} as oo. 



3. Optimizing the sampling procedure. 



3.1. Optimal value for i. We shall now present explicit asymptotic re- 
sults allowing us to optimize £'^, the constant term of a'^{d). We first in- 
troduce a weak convergence result for the process {Z('^)(t),t >0} and most 
importantly in practice, we transform the achieved conclusion into a state- 
ment about efficiency as a function of acceptance rate, as was done in [11]. 

We denote weak convergence in the Skorokhod topology by =^>, standard 
Brownian motion at time t by B{t) and the standard normal cumulative 
distribution function (c.d.f.) by $(•). Moreover, recall that the scaling term 
of the component of interest Xi* is taken to be one [6i*{d) = 1], which, as 
explained in Section 2.2, might require a linear transformation of @~'^{d). 

Theorem 1. Consider a Metropolis algorithm with proposal distribution 
Y(d) ^ ^^d)j where a satisfies (4), applied to a target density satis- 

fying the specified conditions on f as in (1), with ej^{d), j = l,...,d, as m 
(3) and 9i*{d) = 1. Consider the i*th component of the process {Z^'^\t),t > 
0}, that is, {Z^i*\t),t > 0} = {X^?{[d'^t]),t > 0} and let X('^)(0) be dis- 
tributed according to the target density tt in (1). 

We have {zl;f\t),t > 0} ^ {Z{t),t > 0}, where Z{0) is distributed ac- 
cording to the density f and {Z{t),t > 0} satisfies the Langevin stochastic 
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differential equation (SDE) 

dZ{t)=v{£)^^^ dB{t) + \v{e){\ogf{Z{t)))' dt, 

if and only if 

d^^ 

^ ' E]=i + Zti c{J{i, d)) dy^ ■ 

Here, v{e) = 2£2$(-£/E^/2) and 

with c{J'{i,d)) as in (2). 

Intuitively, when none of the target components possesses a scahng term 
significantly smaller than those of the other components, the limiting process 
is the same as that found in [11]. Note that the numerator in condition (5) 
is based on O^'^^d) only, which is not necessarily the asymptotically smallest 
scaling term. Technically, we should then also verify that this condition is 
still satisfied when 9^'^{d) is replaced by 0~^-^^{d); however, this is ensured 
by the presence of the term c{J'{l,d))9'^_^_i{d) in the denominator. 

The function v{£) is sometimes interpreted as the speed measure of the 
diffusion process. As this quantity is proportional to the mixing rate of the 
algorithm, it suffices to maximize the function v{i) in order to optimize the 
efficiency of the algorithm. 

Let a{d,£) be the 7r-average acceptance rate defined in Section 2.1, but 
where the dependence on the dimension and the proposal scaling are now 
made explicit. The following corollary introduces the optimal value i and 
AOAR leading to greatest efficiency of the Metropolis algorithm. 

Corollary 2. In the setting of Theorem 1, we have liuid-too (^idji) = 
2<I>(— = a{£). Furthermore, v{i) is maximized at the unique value 
i = 2.2)^/ \J Ept^ for which a{i) = 0.234 (to three decimal places). 

For a high-dimensional target distribution as defined in Section 2.2 and 
having no component converging significantly faster than the others, the 
value £ should be chosen such that the acceptance rate is close to 0.234 in 
order to optimize the efficiency of the Metropolis algorithm. 

Theorem 1 may be used to determine whether or not the AOAR for 
sampling from any multivariate normal distribution with covariance matrix 
S is 0.234. Since normal random variables are invariant under orthogonal 
transformations, we can transform S into a diagonal matrix where the eigen- 
values of S constitute the diagonal elements. The eigenvalues can then be 
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used to determine whether or not condition (5) is satisfied and hence to 
determine whether or not 2.38/ ^/Er is the optimal scahng for the proposal 
distribution. For example, consider E with af = 2, i = 1, . . . ,d, and aij = 1, 
j 7^ i. The d eigenvalues of S are (d, 1, . . . , 1) and satisfy condition (5). For 
a relatively high-dimensional multivariate normal with such a correlation 
structure, it is thus optimal to tune the acceptance rate to 0.234. Note, 
however, that not all d components mix at the same rate. When studying 
any of the last d — 1 components, the vector 0~^((i) = {d,l, . . . ,1) is appro- 
priate, so o''^{d) =£'^/d and these components thus mix in 0{d) iterations. 
When studying the first component, we need to linearly transform the scal- 
ing vector so that 9^^{d) = 1. We then use 0~^(d) = (1, l/d, . . . , l/d), so 
(j'^{d) =i'^/d'^ and this component mixes according to 0{d'^). 

Now, consider the simple model where Xi ~ N(0, 1) and Xj ~ N{Xi, 1) 
for j = 2, . . . ,d. The joint distribution of X^^^ is multivariate normal with 
mean and d x d covariance matrix such that af = 1, (t| = • • • = o"^ = 2 
and ajk = 1, Vj ^ k. Using the d eigenvalues, which are 0{d), 0{l/d) and 1 
with multiplicity d — 2, we thus conclude that condition (5) is violated and 
that 0.234 might not be optimal, even though the distribution is normal [see 
Theorem 5 of Section 3.2 when dealing with more general 6j{d)^s]. 

The previous example might seem surprising as multivariate normal dis- 
tributions have long been believed to behave as i.i.d. target distributions in 
limit. A natural question to ask, then, is what happens when condition (5) is 
not satisfied? In such the algorithm can be shown to admit the same 

limiting Langevin diffusion process, but with a different speed measure. Fur- 
thermore, the AOAR is found to be smaller than the usual 0.234. For more 
details on this case, see [1]. For a better picture of the applicability of these 
results, examples and simulation studies for various statistical models are 
presented in [2]. 

3.2. Inhomogeneous proposal scaling and extensions. Thus far, we have 
assumed o"^((i) = l"^ /d°^ to be the same for all d components. It is natural to 
wonder whether adjusting the proposal scaling as a function of d for each 
component would yield a better performance of the algorithm. An important 
point to keep in mind is that for {7i^'^\t),t > 0} to be a stochastic process, 
we must speed up time by the same factor for every component. Otherwise, 
some components would move more frequently than others in the same time 
interval and since the acceptance probability of the proposed moves depends 
on all d components, this would violate the definition of a stochastic process. 

The inhomogeneous scheme we adopt is the following: we personalize the 
proposal scaling of the last d — n components only, implying that the pro- 
posal scaling of the first n components is the same as it would have been 
under the homogeneity assumption. We then treat each of the m groups of 
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scaling terms appearing infinitely often as a different portion of the scaling 
vector and determine the appropriate a for each group. 

In particular, consider ©"^(d) in (3) and let the proposal scaling of Xj 
be crj{d) = l"^ /d"'^ , where aj = a for j = 1, . . . , n and Uj is the smallest value 
such that lim^_^oo c{J'{i, d)) d'^^ /d°^3 < oo for j = ?i + 1, . . . , d, j £ J'{i, d). In 
order to study the component Xi* , we still assume that 9i*{d) = 1, but we 
now let Z('^)(t) =X('^)([(i"'*t]). We have the following result. 

Theorem 3. In the setting of Theorem I, but with the proposal scaling 
as just described, the conclusions of Theorem 1 and Corollary 2 are preserved 
and is now expressed as 

Since the proposal scaling is now adjusted to suit every distinct group 
of components, each constant term Kn+i, ■ ■ ■ ,Kn+m has an impact on the 
limiting process, yielding a larger value for Er. Hence, the optimal value i = 
2.38/\/Eji is now smaller than with homogeneous proposal scaling. When 
the proposal scaling of all components was based on a in Section 3.1, the 
algorithm had to compensate for the fact that a is chosen as small as possible 
and thus possibly too small for certain groups of components, with a larger 
value for i"^. 

The conclusions of Section 3.1 also extend to more general target distri- 
bution settings. First, we can relax the assumption of equality among the 
constant terms of 9~'^(d) for j G J'{i,d). In particular, let 

(7) 

We assume that {Kj,j G J^{i,d)} are i.i.d. and chosen randomly from some 
distribution with E[J<j~^] < oo. Without loss of generality, we denote E[i('~"^] = 
hi for j S J'{i,d). Recall that the scaling term of the component of interest 
cannot depend on d, so we have e-,^{d) = Ki,. 

To support the previous modifications, we now suppose that — oo < 7^ < 
7m-i < • • • < 71 < 00. In addition, we assume that there does not exist a Aj, 
J = 1, . . . , n, equal to one of the 7^, i = 1, . . . ,m. This means that if there is 
an infinite number of scaling terms of the same order, they must necessarily 
belong to the same of the m groups. We obtain the following result. 
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Theorem 4. Consider the setting of Theorem 1, except with & ^(d) as 

in (7) andOi* = K'^'"^ . We have {z[?{t),t> 0} => {Z{t),t> 0}, where Z{0) 
is distributed according to the density 9i*f{6i*x) and {Z{t),t > 0} satisfies 
the Langevin SDE 

dZ{t) = {v{l)Y'^ dB{t) + \v{l){\ogf{e,.Z{t)))'dt, 

if and only if condition (5) is satisfied. Here, v{£) is as in Theorem 1 and 



c{J{i,d))d^^ 



1=1 



ffvn\ 

\f{X)J 



with c{J{i,d)) = #{j G {n + 1, . . . ,d};ej{d) is 0{d'^'/^)}. Furthermore, the 
conclusions of Corollary 2 are preserved. 

The previous results can also be extended to more general functions 
c{J{i, d)), i = 1, . . . ,m, and 9j{d), j = 1, . . . ,d. In order to have sensible lim- 
iting theory, however, we restrict our attention to functions for which the 
limit exists as d — > oo. As before, we must have c{J{i,d)) — > oo as d — > oo. 
We even allow {9j'^{d),j G ^{ijd)} to vary within each of the m groups, 
provided they are of the same order. That is, for j G J{i,d), we suppose 

1 In 

that limrf^oo Oj{d)/0'i{d) = K- for some reference function 6^(c?) and some 
constant Kj coming from the distribution described in Theorem 4. 

As for Theorem 4, we assume that if there are infinitely many scaling 
terms of a certain order, then they must all belong to one of the m groups. 
Hence, 0~^(d) contains at least m and at most n + m functions of different 
orders. The positions of the elements belonging to the ith group are thus 

(8) J(i,d) = |jG{l,...,4;0< Jim ^^<oo|, iG{l,...,m}. 

For such target distributions, we define the proposal scaling to be cT^((i) = 
£^<T^((i), with CT\(d) the function of largest possible order such that 

lim Q\{d)a\{d) < oo and 

>oo 

(9) 

lim c{J{i,d))e'^{d)al{d) < oo, i = l,...,m. 

d—^oo 

Theorem 5. Under the setting of Theorem 4, but with proposal scaling 
cr^{d) = i'^a'^{d), where cr^{d) satisfies (9) and with general functions for 
c{J{i,d)) and 9j{d) as defined previously, the conclusions of Theorem 4 are 
preserved, provided that 

0i{d) 

lim 3-—- = 

'^-oo Y.U 0%d) + YZi c{J{h dW^id) 
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holds instead of condition (5) and with 

= ii™ E {d)b.E [ [jj^) '] , 

where c{J{i,d)) is the cardinality function of (8). 

This theorem assumes quite a general form for the scahng terms of the 
target distribution and allows for a lot of flexibility. 

4. Proofs of theorems. We now present the proof of Theorem 1; those 
of the theorems in Section 3.2 being similar, we just outline the main dif- 
ferences. The proofs are based on Theorem 8.2 of Chapter 4 in [7] which 
roughly says that for the finite-dimensional distributions of a sequence of 
processes to converge weakly to those of some Markov process, it is suf- 
ficient to verify convergence of their generators. Then, Corollary 8.6 
of the same chapter provides further conditions for our sequence of pro- 
cesses to be relatively compact and thus to reach weak convergence of the 
stochastic processes themselves. Specifically, it is easily verified that C^, 
the space of infinitely differentiable functions with compact support, is an 
algebra that strongly separates points. Since the algorithm starts in station- 
arity, X('^)(t) ~ TT > 0. Using a method similar to the proof of Lemma 7, 
we show that E[(G/i(d, X^*^)))^] is bounded by some constant for all d>l, 
where G is the generator of the sped-up Metropolis algorithm appearing in 
Section 4.2; this ensures relative compactness. 

Our task is then to focus on the convergence of the generators. To this 
end, we base our approach on the proof for the Metropolis algorithm case 
in [10]. Note, however, that the authors instead prove uniform convergence 
of generators, and this could not be used in the present situation. 

The generator is written in terms of an arbitrary test function h which 
can usually be any smooth function; in our case, we restrict our attention to 
functions in C^. Since the limiting process obtained is a diffusion, it follows 
that is a core for the generator by Theorem 2.1 of Chapter 8 in [7], so 
instead of verifying convergence of the generators for all functions h in 
the domain of G^, we shall be allowed to work with functions belonging to 
this core only. 

In order to ease notation, we adopt the following convention for defin- 
ing vectors: X'^^"") = {Xa+i, ■ ■ ■ ,Xi,) The minus sign appearing outside the 
brackets (e.g., X^^"*^)") means that the component of interest, Xi* , is ex- 
cluded. We also use the following notation for conditional expectations: 
E[/(X, y)|X] = Ey [/(X, y)]. When there is no subscript, the expectation is 
taken with respect to all random variables included in the expression. 
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4.1. Restrictions on the proposal scaling. We first transform condition 
(5) into a statement about the proposal scaling and its parameter a. For 
this condition to be satisfied, we must equivalently have 

Ki fd^^ d^" 
hm -3- — + ••• + — 

+ hm ^ c(J(l,d))-^ + ••• + < J(m,d))- =00. 

Since the first term on the left-hand side is finite, there is at least one i € 
such that Yimd-i G i"^ {d)c{ J {i,d))^^ = 00. Consequently, the 
choice of a in (4) must be based on one of the groups of scaling terms appear- 
ing infinitely often. If we had a = Ai , this would mean that lim^^oo 
00 for all i for which the previous limit was diverging, which contradicts 
the definition of a. When condition (5) is satisfied, it thus follows that 
lim^^ood^V^" = s-'^d ^f^(d) does not govern a; the parameter a is then 
strictly greater than 0, regardless of which component is under considera- 
tion. 

4.2. Proof of Theorem 1. For an arbitrary test function h G C^, we show 
that 

lim E[\Gh{d,X^'^^) - GLh{X,,)\] = 0, 

d~*oo 

where Gh{d,X^'^^) = d''E^id)[{h{Yi,) - A (7r(d, YW)/^((i,xW)))] 

is the discrete-time generator of the sped-up Metropolis algorithm and 
GLh{Xi,) = v{l)[\h"{Xi*) + }^h'{Xi,){\ogf{Xi,))'] is the generator of a 
Langevin diffusion process with speed measure f (^), as in Theorem 1. 

According to Lemma 7, we have limd^ooE[|G/i(d,xW) - G/i(d,X('^))|] = 
0, where 



\i^h"{Xi*)E^^ 



(d)- 



lAexpj <d,Xj,Yj)\ 



j=l,j^i 

/i'(X,.)(log/(X,.))'EYM- 

d '\ d 



X 



{d ^ d 

^ e{d,Xj,Yj)\; e{d,Xj,Y,)<0 



and e{d,Xj,Yj) is as in (10). To prove the theorem, we are thus left to show 
convergence of the generator Gh(d, X^'^)) to the generator of the Langevin 
diffusion. 
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Substituting explicit expressions for the generators, grouping some terms 
and using the triangle inequality, we obtain 



E[\Gh{d,X^'^'^)-GLh{Xi 
1 



-E 



'Y(d)- 



lAexpj ^id,Xj,Yj)\ 



+ £ Ex{d)- 



E 



Y(d)- 



Y e{d,Xj,Yj)<0 
xE[|/i'(X,0(log/(X,0)'|]. 



Since the function h has compact support, it follows that h itself and its 
derivatives are bounded in absolute value by some constant. As a result, 
E[\h"{Xi*)\] and E[|/i'(Xi.)(log/(X,.))'|] are both bounded by K, say. Using 
Lemmas 8 and 9, we then conclude that the first expectation on the right- 
hand side goes to as d — > oo; we reach the same conclusion for the first 
expectation of the second term by applying Lemmas 10 and 11. 



4.3. Proof of Theorem 4. The main difference with the proof of Theo- 
rem 1 occurs when working with the m groups formed of infinitely many 
components. Since the constant terms are now random, we cannot factorize 
the scaling terms of components belonging to the same group. However, this 
difficulty is easily overcome by changes of variable and the use of conditional 
expectations; for instance, a typical quantity we must work with is 



c{Jii,d))d^^ 



E ^log. 



c{j{i,d)) .^^^^^^Kdx, V/^ 



By the weak law of large numbers (WLLN; see, e.g., [15]), the term in brack- 
ets converges to biE[{f'{X)/f{X))'^]. Instead of carrying the term 6'^_|_j(d) = 
(Pi/Kn+i as before, we thus carry bid"'\ 



OPTIMAL SCALING FOR NON-I.I.D. TARGETS 



13 



4.4. Proof of Theorem 5. The general forms of the functions c{J{i,d)), 
i = 1, . . . ,m, and 9j{d), j = 1, . . . ,d necessitate a more elaborate notation, 
but do not affect the body of the proof. Instead, what alters the demon- 
stration is the fact that 9j{d) for j £ J{i,d) are allowed to be different 
functions of d provided they are of the same order. Because of this partic- 
ularity, we must write 9j{d) = K~^^'^0[{d)9*{d)/9[{d), where 9*j{d) is implic- 
itly defined. We can then continue with the proof as usual, factoring the 
term bi9[{d) instead of 9'^j_^{d) in Theorem 1 (or bid'^'- in Theorem 4). Since 
lim^^oo = li the rest of the proof can be repeated with minor 

modifications. 



5. Equivalent generator and other results. 

5.1. Convergence of an approximation term. 
Lemma 6. For i = 1, . . . ,m, let 
W.id,X% Yf ) = l Y: {^\ogf{9,{d)X,)){Y,-X,f 

where YjlXj ~ A^(Xj, ^^/d") and Xj is distributed according to the density 
9j{d)f{9j{d)xj), independently for all j = 1, . . . , d. Then, for i = 1, . . . ,m, 

Wf^jmd,^%,,yY%,,^)\] ^0 asd^oo. 

Proof. By Jensen's inequahty, E[|M/^|] < ^EfH^. Developing the square 
and taking the expectation conditional on . , we obtain 

E^,„ [Whd,X%^,^,Y%^,^)] 

J{i,d) 



^4 



E (^iog/(e,(d)x,)y 



2d'^' 

+ 741 E (^^ogf{9,{d)X,) + (-^logf{9,id)X,)"' 



Using changes of variable, we obtain 
E^„) [m{d,X% Yf )\] 

J{i,d) 



"^3 
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< 



92 . 

'n+t 



,{d)^c{J{i,d)) 



c{J{i,d)) 



( d^ 



dX] 



iogfiXj 



1/2 



+ ^Ol^Mc{Jii,d)) 



1 



f d^ 



c{J{i,d)) 



E (^log/(X,)+(^log/(X 



By the WLLN, the term in parentheses on the second Une converges in 
probabiUty to E[{-^log f{X)f] as 00. Since d° > d^- ^c{J{i,d)) and 
the previous expectation is bounded by some constant, the first term con- 
verges to as d— >oo. Given that 6'^j^^{d)c{J{i,d))/d°^ is 0(1) for at least 
one i S {1, . . . , m}, we must also show that the term between absolute val- 
ues converges to 0. From Lemma A.l, we know that f'{x) — > as x — > ±00; 
hence, we have E[^log/(Xj) + {^\ogf{Xj)f] = Jf"{x)dx = and as 

d — > 00, we conclude (by the WLLN) that 



1 



c{J{i,d)) 



E 



dX] 



log/(X,) + 



d 



dX, 



log fix,) 



5.2. Convergence to the equivalent generator Gh{d, X^'^) 

Lemma 7. For any function h G , let 
Gh{d,y.^'^^) 



>0. 



□ 



^^^/l"(Xj*)EY(d) 



lAexp<^ ^ e{d,Xj,Yj) 



+eh'{Xi,){\ogf{x,,)y 



X E 



Y(d)- 



exp-! 



where 
(10) 



e{d,X,,Yj) = \og 



f{e,{d)x,y 

lfa>0 is as defined in (4), i/ien linid^oo E[|G/i(d, X^'^)) - G/i(d, X^'^))]] = 0. 
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Proof. The proof being similar to that of Lemmas A. 2 and A. 3 in 
[10], we shall omit some details. The generator of the sped-up Metropolis 
algorithm can be expressed as 

^^^^ {h{Y,,) - h{Xi,))E^^,y 



1 Aexp|^^e(d,Xj-,yj)| 



We can reexpress the inner expectation using a Taylor expansion of the 
minimum function with respect to Yi* around Xi* . As mentioned in [10], 
the generator becomes 

G/i(d,x('^)) 

= d"Ey^,[(/i(yi.)-/i(^^0)]EYW- lAexpj <d,Xj,Yj) 
+ d"(log/(Xi.))'Ey^. [{h{Yi,) - h{Xi,)){Yi, - X,.)] 



X E- 



exp 



{d ^ d 

eid,Xj,Yj)\; J2 4d,Xj,Yj)<0 



+ ^Ey^, [(hiY,,) - hiX,.)){Y,, - X,.)'((log/(C/..))')' 



xEY(.)-[e'^(^'*);5(^.*)<0]] 
EyJihiYi,) - hiX,*)){Yi, - Xi,)\logfiU,,))" 



xEYM-[e3(^»*);5(t/.0<0]], 

where g{Ui*) = e{Xi* ,Ui*) + J2'j=i jj^i* £id,Xj,Yj) for some Ui* £ iXi*,Yi*) 
or iYi*,X,,). 

We first note that all expectations computed with respect to Y^'^)" are 
bounded by 1, |(log/(C/j*))"| is bounded by a constant and |(log/(C/i*))'| < 
I (log f{Xi*)y\ + K\Yi* - Xi* I for some K > 0. Expressing h{Yi* ) - h{Xi* ) as a 
three-term Taylor expansion and using the fact that h has compact support, 
we can bound the expectations taken with respect to Yi* and obtain 

|G/i(d,x('^))-G/i(d,x('^))| 



< K 



+ ^ + 



4 £5 



((log/(X,.))') 



/\2 



for some constant K > 0. By assumption, E[((log/(Xj*))')^] < oo, so it fol- 
lows that E[|G/i(d,XW) - G/i((i,XW)|] converges to as oo. □ 



16 



M. BEDARD 



6. Volatility and drift of the diffusion. 

6.1. Convergence to an equivalent volatility. 

Lemma 8. We have 



lim Ex{d)- 

a— »oo 



E 



lAexpj e{d,Xj,Yj)\ 
1 Aexp|z(d,Y('^)-,x('=')-) 



where e{d,Xj,Yj) is as in (10) and 



(12) 



m ^ 

^ dJC 
2d" ^ ^ 



\ogf{e,{d)x,){Yj-Xj 

d 



dX. 



■log f{9j{d)Xj) 



Proof. Using a Taylor expansion with three terms, we obtain 
EyM- 



lAexpj ^{d,Xj,Yj)\ 



:E- 



•Yid)- 



lAexpj Y <d.Xj,Y.j) 



(13) 



+E E 

i=ijeJii,d) 



d 



dX, 



logf{ej{d)Xj){Yj-Xj 



1 d^ 



2dXj 



iogf{ej{d)Xj){Yj-Xj 



+ \^^ogf{e,{d)u,){Y,-x,Y 



for some Uj G {Xj,Yj) or {Yj-,Xj). 
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By the triangle inequality, the Lipschitz property of the function 1 A 
(see Proposition 2.2 in [111) and the observation that the first two terms of 
the function z{d,Y^'^^~' ,'K^^^~) cancel the first two terms of the exponential 
function in (13), we get 



lAexpj ^d^^J^Yj)] 
• Eym- [1 A exp{z(d, Y^'^-) , X('^-))}] 



■ J(i,d) 



[\Wi{d,X 



(d)- 



rid)- 



J{i,d)' J(i,d)> 



+ J2c{J{i,d))fK 



i=l i=l 

By Lemma 6, the right-hand side converges in probability to as d ^ oo. 
We then apply the bounded convergence theorem to complete the proof of 
the lemma. □ 

6.2. Simplified expression for the equivalent volatility. 

Lemma 9. // condition (5) is satisfied, then 



lim Ex(d)- 

a— too 



EY(d)- [1 A exp{z(d, Y^'^)- , X(^)- )}] - 2$ 



where z{d,Y^'^^~ and Er are as in (12) and (6), respectively. 

Proof. For each group of components whose scaling term appears in- 
finitely often in the limit, that is, for i = 1, . . . ,m, let 

2 



(14) 



Ri{d,x 



(d)- 



1 



J{i4)' fja 



E 



d 

dx-i 



log fie j{d)xj) 



Since [Yj - Xj)\Xj ~ i.i.d. N{0,f/d'^) for j = 1, . . . ,d, it follows that 
z{d, Y^'^)- , X^"^)- ) I Y(")- , X^'^)" 



Ni Y £(^>^i>^.)-yE^^('^'Xl7M))'^'E^^(^'Xi7(." 



i=l 



1=1 



d)' 



Applying Proposition 2.4 in [11] allows us to obtain an expression in terms 
of <!>(•), the c.d.f. of a standard normal random variable, 

EyM- [1 A exp{z(d, Y^'^)- , X^"^)-)}] 

= EY{n)- expl Y ^{d,Xj,Yj)\ 
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X <^> 



Yl'j=i,j^i'- ^{d, Xj ,Yj) - ^YliLiRiid, xl^|.^^p 



J{i,d)' 



We note that > since there is at least one i G {1, . . . ,m} such that 
lim^i^oo d)) d'^'/d" > 0. Using Propositions A. 2 and A. 3 and then ap- 

plying Slutsky's theorem and the continuous mapping theorem, we conclude 
that exp(X;"=ijyi* e(d, Xj,Yj)) 1 and 



''ET=iMd,^%,,)) 



Since EY(d-n)- [1 A e^^'^''^''*' '■^^''^ ^] is positive and bounded by 1, we use 

the bounded convergence theorem to conclude that E[l A e^^'^'^^''^ '''' ^] — > 

p 

2^{—i\/Eji/2); we complete the proof of the lemma by reapplying the 
bounded convergence theorem. □ 

6.3. Convergence to an equivalent drift. 



Lemma 10. We have 

lim Ex(d)- EY(<i)- 

d—too 



(15) 



exp< £{d,Xj,Yj)\; 

d 

eid,Xj,Y,)<0 



EY(.)-[exp{z(ci,Y('^)",xW-)}; 

z(d,Y('^)-,X("')-) <0] 



where e{d, Xj ,Yj) and z{d,Y^'^^ jX^*^) ) are as in (10) and (12), respec- 
tively. 
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Proof. First, let T{x) = e^l(a.<o), 

^(d,Y('^)",x('^)-)=T(' e{d,Xj,Yj)\ -T{z{d,Y^'^^-,X^'^'^-)) 



and 



6{d) = [ EV'')- [l^^('^>x5')r^)>Y^'&))l] + E<^(^'^)K'^ 



.1=1 



J(i,d) 



1=1 



.d^ 

^3a/2 



1/2 



We shall show that Afd, Y('^)-,X('^)-)|X('^)- ^0 and then use this result to 

p 

prove convergence of expectations. 

Similarly to the proof of Lemma A. 7 in [10], we have 



PY(.)-(|A(d,YW-,xW-)|><^((i)) 



(16) 



<P 



Y{d)- 



E eid,X„Y,)-z{d,Y(''^-,X(''^~) 



>Sid) 



PY(d)-(-5((i) < z(d,Y('^)-,X('^)-) < 6{d)). 



By Markov's inequality and the proof of Lemma 8, the first term on the 
right-hand side is bounded by 



1 



E. 



Y{d)- 



as d^ oo. Using conditioning and the proof of Lemma 9, the second term 
on the right-hand side becomes 



Y(d)- 



-(|z(ci,YW-,xW-)|<<^((i)) 



Y(n)- 



^ 5{d) - E]=i,j^i^ e{d, X„Y,) + f J:T=i Ud, xgjr 



d)' 



YT=iRi{d,^^''^ 



J(i,d)> 



-m - E]=l,j^^* e{d,X„Y,) + 'iYZiUd^^j^ 

J2iLiRi{d,^j\i,i)) 



Using the convergence results developed in the proof of Lemma 9, along 

with the fact that 6(d) ^0 as d — > oo and the bounded convergence theo- 

p 

rem, we deduce that the previous expression converges in probability to 0. 
Therefore, ^(d, Y('^)-,X('^)-)|XW- ^0 and (15) follows by reapplying the 

bounded convergence theorem twice. □ 



20 



M. BEDARD 



6.4. Simplified expression for the equivalent drift. 



Lemma 11. If condition (5) is satisfied, then 



ci—> oo 



lim E 



Eyw- [exp{z{d, Y^'^)- , XW-)}; 2((i, Y^'^)" , X^") < 0] 




= 



where the functions e{d,Xj,Yj) and z{d,Y^^^ ,X('') ) are as in (10) and 
(12), respectively. 

Proof. The proof is similar to that of Lemma 9, the only difference 
lying in the fact that (Proposition 2.4 in [11]) 



7. Discussion. The theorems in this paper basically extend the i.i.d. work 
of Roberts, Gelman and Gilks [11] to a more general setting where the scal- 
ing term of each target component is allowed to depend on the dimension of 
the target distribution. The conclusions achieved are similar to those in [11], 
since the AOARs are identical; the sole difference lies in the optimal scaling 
values themselves. Condition (5), which says that no target component con- 
verges significantly faster than the others, ensures that the process behaves 
asymptotically as in the i.i.d. case. This work thus partially answers Open 
Problem #3 of [14]. 

These results can also be used to determine, for virtually any correlated 
multivariate normal target distribution, whether or not 0.234 is optimal. 
Contrary to what seemed to be a common belief, multivariate normal dis- 
tributions do not always adopt a conventional limiting behavior and there 
exist cases where the AGAR is significantly smaller than 0.234 (see [1]). 

It was shown in the i.i.d. case that although asymptotic, the results are 
fairly accurate in small dimensions (d > 10). In the present case, however, 
this fact is not always verified and care must be exercised in practice. In 
particular, if there exists a finite number of scaling terms such that Xj is 
close to a [but with Xj < a, otherwise condition (5) would be violated]. 



Eym- [exp{z(d, Y W- , X(^)- )}; z((i, Y('^)- , X^") < 0] 





□ 
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then the optimal acceptance rate converges extremely slowly to 0.234 from 
above. For instance, suppose that ©"^(d) = {d~'^, 1, . . . , 1) with A < 1. The 
proposal scaling is then cr^(d) = i'^/d and the closer A is to 1, the slower 
the convergence of the optimal acceptance rate is to 0.234. In fact, for a 
multivariate normal target with A = 0.75, simulations show that d must be 
as large as 200,000 for the optimal acceptance rate to be reasonably close 
to 0.234; they also show that for a — A > 0.5, the asymptotic results are 
accurate in relatively small dimensions, just as in the i.i.d. case. Detailed 
examples and simulation studies illustrating the results introduced in this 
paper and in [1] are presented in [2]. 

APPENDIX 

Lemma A.l. Let f he a C"^ probability density function (p.d.f.). If 
(log/(x))' is Lipschitz continuous, then f'{x) ^0 as x — > ±00. 

Proof. The asymptotic behavior of a p.d.f. as 2; — > ±00 can be 
one of three things: (1) f{x) 0, f'{x) 0; (2) f{x) 0, f{x) ^ 0; (3) 
f{x) ^0, /'(x) ^0. We prove that in cases (2) and (3), (log/(x))' is not 
Lipschitz continuous, which implies that (1) is the only possible option. 

(2) f{x) 0, /'(x) ^ 0: Since / ^ 0, it follows that Ve > 0, 3xo(e) G R 
such that Vx > xo(e), fix) < e. Since /' ^ 0, it follows that Ve > 0, 3x* > 
xo(e) + 1 such that |/'(x*)| > limsup |/'|/2. Because / is C^, we have VO < 
e < limsup |/'|/2, 3y with |x* — y\<l such that \f'{y)\ = e. Now, choose y* 
to be the value of y which minimizes |x* — y\, but such that f{y*) > f{x*). 
Given < e < limsup |/'|/2, we then have 

\nx)/f{x)-ny)/f{y)\ ^ \\f'{x*)\/f{x*)-\f{y*)\/f{y*)\ 
sup 1 _ I > ^ 

^ limsup 1/1/2 -e 
~ e 

Since this is true for all < e < limsup |/'|/2, the Lipschitz continuity as- 
sumption is violated. 

(3) /(x) 0, /'(x) ^ 0: Since / is continuous, positive and // = 1, it 
follows that Ve > 0, 3xo(e) G R such that /(x) < e for x > xo(e), except on 
a set of Lebesgue measure X{A^) < e. Since (— oo,e) is an open set, it 
follows that i? = {x G R : /(x) < e} must also be open; = i?*^ n [xo(e), 00) 
is then formed from closed intervals over which /(x) > e. 

Since / 0, it follows that Ve > 0, there exists an interval [x(e),y(e)] 
in Ag: where the maximum value reached by / over this interval (/i(e) say) 
is such that /i(e) > limsup |/|/2. There might be many values in the in- 
terval for which /(x) = /j(e), but all of these values will satisfy /'(x) = 
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0. Since /(x(e)) = /(y(e)) = e, it follows that sup^.gR/'(x) > y^^)ljl^) > 

^^^^^""^ ■ Hence, sup^gpj^ > and since this is true Ve > 0, we have 

suPxgR = CO. Given e > 0, we take y to be one of the points in [x(e), y(e)] 
such that /(y) = /i(e) and f'{y) = 0. We then have 

\nx)/f{x)-f{y)/f{y)\ ^ |f(x)//(x)-0| 
sup . . > sup 

|f(x)//(x)-0| 
> sup = oo 

a;GR e 

and we see that the Lipschitz continuity assumption is violated. Note that 
in cases (2) and (3), we have considered the case where x ^ oo; we can 
construct a similar argument for the case where x — > — oo. □ 

Proposition A. 2. Let e{d,Xj,Yj), j = l,...,n, be as in (10). If Xj < a, 
then e{d,Xj,Yj) ^0. 

Proof. By Taylor's theorem, we have for some Uj G {Xj,Yj) or {Yj,Xj) 
E[\e{d,X„Y,)\] 

= E[|(iog/(^,(d)x,))'(y, - X,) + ^(iogf{e,{d)Xj))"iY, - x,f 

+ \{\ogf{e,{d)u^))"'{Y,-x,n- 

Applying changes of variable and using the fact that |(log/(X))"| and 
|(log /([/))'"! are bounded by a constant, we obtain, for some -fT > 0, 

ne{d,X,,Y,)\]<l^Km^ogf{X))'\] + [l'—+^^^)K. 

By assumption, E[|(log/(X))'|] is bounded by some finite constant. Since 
Aj < a, the previous expression converges to as d— > oo. To complete the 
proof of the proposition, we use Markov's inequality and find that for all 
e>0, V{\e{d,Xj,Yj)\ >e)<¥.[\e{d,Xj,Yj)\]/e^{) as d^oo. □ 

Proposition A. 3. Let Ri{d,X.j^^~^^) be as in {14), with i £{!,..., m}. 
We have X^I^Li where is as in (6). 

Proof. The expectation of each variable satisfies E[i?j(d, X^|^ = 
c{j{^^d)) E[( '^^^^.j^ )^]. By independence between the Xj^s and the fact 



OPTIMAL SCALING FOR NON-I.I.D. TARGETS 



23 



that Var(X) < E[X^], we obtain 



1 



■c{J{i,d))E 



i=l 



\f{X)) 



By assumption, E[(y^^)^] is finite and since c{J{i,d))(P'^^ < d^", the vari- 
ance converges to as d ^ oo. To conclude the proof, we use Chebychev's in- 
equality and find that Ve > 0, V{\YT=iRii.d^^%4)) - Er\ > e) < 



as d ^ DO . □ 
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